import pandas as pd
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules
import plotly.express as px
import warnings
warnings.filterwarnings('ignore')
This objective of this project was to analyse some grocery data and conduct market basket analysis. Market basket analysis is a type of association analysis that aims to find associations/ correlations between certain items/groceries being purchased by customers. Such analysis is conducted so certain products can be recomended if certain products are already pruchased/ picked for purchase by a customer. In my analysis I used the python's MLExtend libarary, which uses the apriori algorithm to conduct association anlaysis, to find some associations between groceries purchased by members of a supermarket. The data was sourced from kaggle at https://www.kaggle.com/datasets/heeraldedhia/groceries-dataset. In my analysis I have found that there were not concrete and significant relationships between the grocieries being purchased by customers. However certain products (X) had a considerable amount of chance of influencing the purchase of another product (Y), but the products in the relationships were not purchased as often together to make note of and act on their potential relationship.
This lack of associations being observed in the data could be due to the fact that 10,000 out of 15000 transactions only included 2 items, which were maybe purchased without any relation to each other. So perhaps if we had more more data in general, with more transactions including more products, maybe more significant associations could have been found. Regardless this notebook accounts my attempts to explore associations and the procedure used to arrive at my conclusions.
df = pd.read_csv("groceries_data.csv")
df
| Member_number | Date | itemDescription | |
|---|---|---|---|
| 0 | 1808 | 21-07-2015 | tropical fruit |
| 1 | 2552 | 05-01-2015 | whole milk |
| 2 | 2300 | 19-09-2015 | pip fruit |
| 3 | 1187 | 12-12-2015 | other vegetables |
| 4 | 3037 | 01-02-2015 | whole milk |
| ... | ... | ... | ... |
| 38760 | 4471 | 08-10-2014 | sliced cheese |
| 38761 | 2022 | 23-02-2014 | candy |
| 38762 | 1097 | 16-04-2014 | cake bar |
| 38763 | 1510 | 03-12-2014 | fruit/vegetable juice |
| 38764 | 1521 | 26-12-2014 | cat food |
38765 rows × 3 columns
renaming columns:
df = df.rename(columns = {"Member_number": "member_number", "Date": "date", "itemDescription": "item_desc"})
Ensuring data types of the values in the data corresponding to each attribute is suitable:
df.dtypes
member_number int64 date object item_desc object dtype: object
Member_number is a categorical variable in this data rather than a quantitative variable as it is not unique and would be used to grouped togethor member numbres to find the purchases of a single member on a particular day. Thus, it is better a string (object) than an integer:
df["member_number"] = df["member_number"].astype('str')
The date should also be of type datetime and we can add a year a month column:
df["date"] = pd.to_datetime(df["date"])
df['year'] = pd.DatetimeIndex(df['date']).year
df['month'] = pd.DatetimeIndex(df['date']).month
df
| member_number | date | item_desc | year | month | |
|---|---|---|---|---|---|
| 0 | 1808 | 2015-07-21 | tropical fruit | 2015 | 7 |
| 1 | 2552 | 2015-05-01 | whole milk | 2015 | 5 |
| 2 | 2300 | 2015-09-19 | pip fruit | 2015 | 9 |
| 3 | 1187 | 2015-12-12 | other vegetables | 2015 | 12 |
| 4 | 3037 | 2015-01-02 | whole milk | 2015 | 1 |
| ... | ... | ... | ... | ... | ... |
| 38760 | 4471 | 2014-08-10 | sliced cheese | 2014 | 8 |
| 38761 | 2022 | 2014-02-23 | candy | 2014 | 2 |
| 38762 | 1097 | 2014-04-16 | cake bar | 2014 | 4 |
| 38763 | 1510 | 2014-03-12 | fruit/vegetable juice | 2014 | 3 |
| 38764 | 1521 | 2014-12-26 | cat food | 2014 | 12 |
38765 rows × 5 columns
df.dtypes
member_number object date datetime64[ns] item_desc object year int64 month int64 dtype: object
Checking for any null/ NA values (no values can be observed --> False):
df.isnull().values.any()
False
df
| member_number | date | item_desc | |
|---|---|---|---|
| 0 | 1808 | 21-07-2015 | tropical fruit |
| 1 | 2552 | 05-01-2015 | whole milk |
| 2 | 2300 | 19-09-2015 | pip fruit |
| 3 | 1187 | 12-12-2015 | other vegetables |
| 4 | 3037 | 01-02-2015 | whole milk |
| ... | ... | ... | ... |
| 38760 | 4471 | 08-10-2014 | sliced cheese |
| 38761 | 2022 | 23-02-2014 | candy |
| 38762 | 1097 | 16-04-2014 | cake bar |
| 38763 | 1510 | 03-12-2014 | fruit/vegetable juice |
| 38764 | 1521 | 26-12-2014 | cat food |
38765 rows × 3 columns
popular_df = df.groupby("item_desc").count().reset_index()
popular_df = popular_df.sort_values(by="member_number", ascending=False).head(10)
The bar chart below shows that whole milk is the most popular item purchased, next is 'other vegetables', then rolls/buns, then soda and then tropical fruit etc.
fig = px.bar(popular_df, x = "item_desc", y = "member_number", title = "Top 10 items purchased", labels = {"item_desc": "item name", "member_number":"number of times purchased"})
fig.show()
regular_df = df.groupby("member_number").count().reset_index()
regular_df = regular_df.sort_values(by="date", ascending=False).head(10)
regular_df
| member_number | date | item_desc | |
|---|---|---|---|
| 2120 | 3180 | 36 | 36 |
| 2665 | 3737 | 33 | 33 |
| 1994 | 3050 | 33 | 33 |
| 1026 | 2051 | 33 | 33 |
| 2838 | 3915 | 31 | 31 |
| 1388 | 2433 | 31 | 31 |
| 1575 | 2625 | 31 | 31 |
| 1234 | 2271 | 31 | 31 |
| 2798 | 3872 | 30 | 30 |
| 3774 | 4875 | 29 | 29 |
fig = px.bar(regular_df, x = "member_number", y = "item_desc", title = "Top 10 regular customers", labels = {"member_number": "member number", "item_desc":"number of times member purchased item in store"})
fig.show()
monthly_df = df.groupby("month").count().reset_index()
monthly_df["month"].loc[0] = "January"
monthly_df["month"].loc[1] = "February"
monthly_df["month"].loc[2] = "March"
monthly_df["month"].loc[3] = "April"
monthly_df["month"].loc[4] = "May"
monthly_df["month"].loc[5] = "June"
monthly_df["month"].loc[6] = "July"
monthly_df["month"].loc[7] = "August"
monthly_df["month"].loc[8] = "September"
monthly_df["month"].loc[9] = "October"
monthly_df["month"].loc[10] = "Novemeber"
monthly_df["month"].loc[11] = "December"
monthly_df
| month | member_number | date | item_desc | year | |
|---|---|---|---|---|---|
| 0 | January | 3333 | 3333 | 3333 | 3333 |
| 1 | February | 3032 | 3032 | 3032 | 3032 |
| 2 | March | 3283 | 3283 | 3283 | 3283 |
| 3 | April | 3172 | 3172 | 3172 | 3172 |
| 4 | May | 3335 | 3335 | 3335 | 3335 |
| 5 | June | 3316 | 3316 | 3316 | 3316 |
| 6 | July | 3268 | 3268 | 3268 | 3268 |
| 7 | August | 3498 | 3498 | 3498 | 3498 |
| 8 | September | 2963 | 2963 | 2963 | 2963 |
| 9 | October | 3218 | 3218 | 3218 | 3218 |
| 10 | Novemeber | 3273 | 3273 | 3273 | 3273 |
| 11 | December | 3074 | 3074 | 3074 | 3074 |
fig = px.line(monthly_df, x="month", y="item_desc", title='items sold monthly', labels = {"item_desc":"number of items purchased"})
fig.show()
For the MLExtend library to properly find associations, each row of the data provided should be a transaction in a 1- hot encoded format. Which means each row should be a transaction and in this row, if a certain item has been purchased then a 1 should be denoted and a 0 if the item has not been purchased (scroll down to see this format).Below follows the code to achieve such a format.
member_transactions = {}
i = 0
while i < len(df):
key = (df.iloc[i]["member_number"], df.iloc[i]["date"])
item = df.iloc[i]["item_desc"]
if key not in member_transactions:
member_transactions[key] = [item]
else:
member_transactions[key].append(item)
i += 1
The data is being stored in a dictionary, where each key in the dictionary will contain a member id and a date on which they made a transactin. Each key will contain a list of the items purchased in the transaction on that date by the member. Below I check my dictionary with member "1808" and cross check with the source data:
#member_transactions["1808", 21-07-2015]
# cross checking with original data to ensure that member_transactions has recorded correct information
df.set_index("member_number").loc["1808"].reset_index().set_index("date").loc["21-07-2015"]
| member_number | item_desc | year | month | |
|---|---|---|---|---|
| date | ||||
| 2015-07-21 | 1808 | tropical fruit | 2015 | 7 |
| 2015-07-21 | 1808 | rolls/buns | 2015 | 7 |
| 2015-07-21 | 1808 | candy | 2015 | 7 |
Next I will create a new dictionary which will be converted into a dataframe. This new dictionary will contain a transaction number and products as a key (which will become the columns of the data frame) and will have a list containing 1's and 0's for their value (this list will be become rows of the data frame). The 1-hot encoded format can be observed:
new_df_columns = list(df["item_desc"].unique()) # different items_desc which will become columns
new_df_dict = {"transaction_no": []}
# adding items as keys (future columns) to the new_df_dict (future data frame)
for column in new_df_columns:
new_df_dict[column] = []
i = 0
for key in member_transactions:
new_df_dict["transaction_no"].append(i)
items_list = member_transactions[key]
# If an item in items_list exists in new_df_dict.keys() then we assign a '1' for that item in new_df_dict
for item in new_df_dict.keys():
if item != "transaction_no":
if item in items_list:
new_df_dict[item].append(1)
else:
new_df_dict[item].append(0)
i += 1
transactions_df = pd.DataFrame(data=new_df_dict).set_index("transaction_no")
transactions_df
| tropical fruit | whole milk | pip fruit | other vegetables | rolls/buns | pot plants | citrus fruit | beef | frankfurter | chicken | ... | flower (seeds) | rice | tea | salad dressing | specialty vegetables | pudding powder | ready soups | make up remover | toilet cleaner | preservation products | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| transaction_no | |||||||||||||||||||||
| 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 14958 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 14959 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 14960 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 14961 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 14962 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
14963 rows × 167 columns
Finding the most frequent groceries in each transaction using the apriori algorithm. The most frequent is 'whole milk' which appears 15% of the time and the next most frequent are other vegetables, roll/buns, soda and yogurt which appear 12% to 8% of the time. This is not good for associations to be generated as there are not many items which occur a lot in the transactions. Occuring more times would result in associations being formed with other items if they also occur frequently, but no two items the data set are occuring regulary as whole milk appears only 15% of the time and the second most frequent 'other vegetables' occurs only 12% of the time.
frequent_itemsets = apriori(transactions_df, min_support = 0.001, use_colnames=True)
frequent_itemsets.sort_values(by=["support"], ascending=False).head()
| support | itemsets | |
|---|---|---|
| 1 | 0.157923 | (whole milk) |
| 3 | 0.122101 | (other vegetables) |
| 4 | 0.110005 | (rolls/buns) |
| 37 | 0.097106 | (soda) |
| 17 | 0.085879 | (yogurt) |
sum(transactions_df["whole milk"]) # 2369 out of 14963 transactions 2369/14963 = 0.158
sum(transactions_df["other vegetables"]) # 1827
sum(transactions_df["rolls/buns"]) # 1646
sum(transactions_df["soda"]) # 1453
sum(transactions_df["yogurt"]) # 1285
1285
Finding associations using MLExtend's association_rules() function:
rules = association_rules(frequent_itemsets, metric="lift")
rules
| antecedents | consequents | antecedent support | consequent support | support | confidence | lift | leverage | conviction | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | (tropical fruit) | (rolls/buns) | 0.067767 | 0.110005 | 0.006082 | 0.089744 | 0.815816 | -0.001373 | 0.977741 |
| 1 | (rolls/buns) | (tropical fruit) | 0.110005 | 0.067767 | 0.006082 | 0.055286 | 0.815816 | -0.001373 | 0.986788 |
| 2 | (tropical fruit) | (fruit/vegetable juice) | 0.067767 | 0.034017 | 0.002139 | 0.031558 | 0.927711 | -0.000167 | 0.997461 |
| 3 | (fruit/vegetable juice) | (tropical fruit) | 0.034017 | 0.067767 | 0.002139 | 0.062868 | 0.927711 | -0.000167 | 0.994773 |
| 4 | (chocolate) | (tropical fruit) | 0.023592 | 0.067767 | 0.001403 | 0.059490 | 0.877860 | -0.000195 | 0.991199 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 729 | (soda, rolls/buns) | (other vegetables) | 0.008087 | 0.122101 | 0.001136 | 0.140496 | 1.150651 | 0.000149 | 1.021402 |
| 730 | (other vegetables, rolls/buns) | (soda) | 0.010559 | 0.097106 | 0.001136 | 0.107595 | 1.108013 | 0.000111 | 1.011753 |
| 731 | (soda) | (other vegetables, rolls/buns) | 0.097106 | 0.010559 | 0.001136 | 0.011700 | 1.108013 | 0.000111 | 1.001154 |
| 732 | (other vegetables) | (soda, rolls/buns) | 0.122101 | 0.008087 | 0.001136 | 0.009305 | 1.150651 | 0.000149 | 1.001230 |
| 733 | (rolls/buns) | (soda, other vegetables) | 0.110005 | 0.009691 | 0.001136 | 0.010328 | 1.065785 | 0.000070 | 1.000644 |
734 rows × 9 columns
Now that we have generated our associations we can see that 734 rules have been generated. Of those only some rules are worth investigating. The ones that are worth looking more at are rules that have the highest supports, high confidences and a lift score of greater than 1. We will only look at rules with lifts greater than 1 as this would imply that the antecedent (X) and consequent (Y) are not independent, which means that that X has an effect on Y.
The support of an association between antecedents (X) and consequents (Y), is in essence the probability of both X and Y occuring in a transaction of a customer. If we look at the results generated below, under the condition that the lift is greater than 1, we can see that only 240 rows are worth looking at and the confidences of the rules range from 0.5% (top row) to 0.1% (last row). This shows that, in our data set, all these rules have occured 0.5% to 0.1% percent of the time. This means that unfortunately the associations/ rules may not be very significant or applicable. This means that there are really no relationships between certain products, but this could be very well due to the data, and perhaps more data may lead to more concrete and substational associations.
rules[(rules['lift'] >= 1)].sort_values(by="support", ascending=False)
| antecedents | consequents | antecedent support | consequent support | support | confidence | lift | leverage | conviction | |
|---|---|---|---|---|---|---|---|---|---|
| 502 | (sausage) | (soda) | 0.060349 | 0.097106 | 0.005948 | 0.098560 | 1.014975 | 0.000088 | 1.001613 |
| 503 | (soda) | (sausage) | 0.097106 | 0.060349 | 0.005948 | 0.061253 | 1.014975 | 0.000088 | 1.000963 |
| 453 | (sausage) | (yogurt) | 0.060349 | 0.085879 | 0.005748 | 0.095238 | 1.108986 | 0.000565 | 1.010345 |
| 452 | (yogurt) | (sausage) | 0.085879 | 0.060349 | 0.005748 | 0.066926 | 1.108986 | 0.000565 | 1.007049 |
| 167 | (other vegetables) | (frankfurter) | 0.122101 | 0.037760 | 0.005146 | 0.042146 | 1.116150 | 0.000536 | 1.004579 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 368 | (chicken) | (margarine) | 0.027869 | 0.032213 | 0.001002 | 0.035971 | 1.116675 | 0.000105 | 1.003899 |
| 349 | (cream cheese ) | (frankfurter) | 0.023658 | 0.037760 | 0.001002 | 0.042373 | 1.122169 | 0.000109 | 1.004817 |
| 348 | (frankfurter) | (cream cheese ) | 0.037760 | 0.023658 | 0.001002 | 0.026549 | 1.122169 | 0.000109 | 1.002969 |
| 301 | (citrus fruit) | (candy) | 0.053131 | 0.014369 | 0.001002 | 0.018868 | 1.313120 | 0.000239 | 1.004586 |
| 690 | (white bread) | (domestic eggs) | 0.023993 | 0.037091 | 0.001002 | 0.041783 | 1.126477 | 0.000113 | 1.004896 |
240 rows × 9 columns
Confidence is the probability of the consequent (Y) being in a transaction given that the antecedents (X) has already been purchased/ picked by the customer. Confidence is useful as we can observe the relationships between products.
As all the associations share low support they may not be significant. However, it is still worth investigating the confidence of these rules to see if certain products are 'related' to each other even though they have a low chance of actually being purchased.
The results below with lifts greater than 1 show that the confidences of 240 rules range from 25% to 0.6%. The top most result shows that customers who purchased yogurt and sausages togethor (0.5% chance of this actually occuring) have a reasonable 25% chance of purchasing whole milk (has a 15% chance of being purchased in general, also relatively low). Since the chance of a customer purchasing yogurt and sausage is low (antecedent support of 0.5%), the associations support is also greatly reduced even though whole milk has an okay individual support (consequent support) of 15%. Since this rule has a very low chance of 0.1% of occuring the association isn't really useful but interesting to know for future expansions to the data, which could maybe increase the support of this association.
rules[(rules['lift'] >= 1)].sort_values(by="confidence", ascending=False)
| antecedents | consequents | antecedent support | consequent support | support | confidence | lift | leverage | conviction | |
|---|---|---|---|---|---|---|---|---|---|
| 717 | (sausage, yogurt) | (whole milk) | 0.005748 | 0.157923 | 0.001470 | 0.255814 | 1.619866 | 0.000563 | 1.131541 |
| 711 | (sausage, rolls/buns) | (whole milk) | 0.005347 | 0.157923 | 0.001136 | 0.212500 | 1.345594 | 0.000292 | 1.069304 |
| 722 | (sausage, soda) | (whole milk) | 0.005948 | 0.157923 | 0.001069 | 0.179775 | 1.138374 | 0.000130 | 1.026642 |
| 124 | (semi-finished bread) | (whole milk) | 0.009490 | 0.157923 | 0.001671 | 0.176056 | 1.114825 | 0.000172 | 1.022008 |
| 705 | (yogurt, rolls/buns) | (whole milk) | 0.007819 | 0.157923 | 0.001337 | 0.170940 | 1.082428 | 0.000102 | 1.015701 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 85 | (whole milk) | (detergent) | 0.157923 | 0.008621 | 0.001403 | 0.008887 | 1.030824 | 0.000042 | 1.000268 |
| 708 | (whole milk) | (yogurt, rolls/buns) | 0.157923 | 0.007819 | 0.001337 | 0.008464 | 1.082428 | 0.000102 | 1.000650 |
| 165 | (other vegetables) | (pot plants) | 0.122101 | 0.007819 | 0.001002 | 0.008210 | 1.049991 | 0.000048 | 1.000394 |
| 714 | (whole milk) | (sausage, rolls/buns) | 0.157923 | 0.005347 | 0.001136 | 0.007194 | 1.345594 | 0.000292 | 1.001861 |
| 727 | (whole milk) | (sausage, soda) | 0.157923 | 0.005948 | 0.001069 | 0.006771 | 1.138374 | 0.000130 | 1.000829 |
240 rows × 9 columns
Top 10 associations with high confidence
rules[(rules['lift'] >= 1)].sort_values(by="confidence", ascending=False).head(10)
| antecedents | consequents | antecedent support | consequent support | support | confidence | lift | leverage | conviction | |
|---|---|---|---|---|---|---|---|---|---|
| 717 | (sausage, yogurt) | (whole milk) | 0.005748 | 0.157923 | 0.001470 | 0.255814 | 1.619866 | 0.000563 | 1.131541 |
| 711 | (sausage, rolls/buns) | (whole milk) | 0.005347 | 0.157923 | 0.001136 | 0.212500 | 1.345594 | 0.000292 | 1.069304 |
| 722 | (sausage, soda) | (whole milk) | 0.005948 | 0.157923 | 0.001069 | 0.179775 | 1.138374 | 0.000130 | 1.026642 |
| 124 | (semi-finished bread) | (whole milk) | 0.009490 | 0.157923 | 0.001671 | 0.176056 | 1.114825 | 0.000172 | 1.022008 |
| 705 | (yogurt, rolls/buns) | (whole milk) | 0.007819 | 0.157923 | 0.001337 | 0.170940 | 1.082428 | 0.000102 | 1.015701 |
| 716 | (sausage, whole milk) | (yogurt) | 0.008955 | 0.085879 | 0.001470 | 0.164179 | 1.911760 | 0.000701 | 1.093681 |
| 84 | (detergent) | (whole milk) | 0.008621 | 0.157923 | 0.001403 | 0.162791 | 1.030824 | 0.000042 | 1.005814 |
| 79 | (ham) | (whole milk) | 0.017109 | 0.157923 | 0.002740 | 0.160156 | 1.014142 | 0.000038 | 1.002659 |
| 248 | (processed cheese) | (rolls/buns) | 0.010158 | 0.110005 | 0.001470 | 0.144737 | 1.315734 | 0.000353 | 1.040610 |
| 228 | (packaged fruit/vegetables) | (rolls/buns) | 0.008488 | 0.110005 | 0.001203 | 0.141732 | 1.288421 | 0.000269 | 1.036967 |
sample_df = transactions_df.sum(axis="columns")
sample_df = pd.DataFrame(sample_df)
sample_df = sample_df.reset_index().rename(columns={0:"sum"})
sample_df = sample_df.groupby("sum").count().reset_index()
sample_df
| sum | transaction_no | |
|---|---|---|
| 0 | 1 | 205 |
| 1 | 2 | 10012 |
| 2 | 3 | 2727 |
| 3 | 4 | 1273 |
| 4 | 5 | 338 |
| 5 | 6 | 179 |
| 6 | 7 | 113 |
| 7 | 8 | 96 |
| 8 | 9 | 19 |
| 9 | 10 | 1 |
We can see in the bar plot below, 10000 out of 15000 transactions only included 2 items. Perhaps data regarding more items in each transaction could've helped find more relationships due to potentially more common pairs or sets of items.
fig = px.bar(sample_df, x = "sum", y = "transaction_no", title = "Amount of items purchased in each transaction", labels = {"sum": "number of items purchased", "transaction_no":"number of transactions"})
fig.show()